Understanding Data Chunking

Demo for Edu Pilot

Lina M. Estupinan-Suarez, Felix Cremer, Fabian Gans

Max Planck Institute for Biogeochemistry

2023-04-28

Contents

  • Chunking overview
  • Data exploration
  • Statistical computation
  • Final remarks

Introduction

Task

Compute the mean and median per spatial pixel for the air_temperature_2m variable (source: ERA5) without loading the whole data set into memory (7 GB uncompressed file).

  • Input: air_temperature_2m with two different chunking configurations:

    • t2_map.nc: This chunked file setting aims at fast access to spatial layers i.e., grids in latitude and longitude (map chunking).

    • t2_blocks.nc: This chuncked file setting aims at an intermediate access to both the spatial and temporal dimensions (box chunking).

Data exploration

Let’s launch Julia and explore the data

# load libraries
using NetCDF # for loading of NetCDF data
using DiskArrays: eachchunk # for exploring the chunking of the data
using Statistics # for statistical analyses 

# download data
filebase = artifact"example_nc"

# load files metadata
xmap0 = NetCDF.open(joinpath(filebase,"t2m_map.nc"))
xbox0 = NetCDF.open(joinpath(filebase,"t2m_blocks.nc"));

# load data 
xmap = xmap0["air_temperature_2m"];
xbox = xbox0["air_temperature_2m"];

Visual inspection

Global map

include("plots4chunking.jl")
geoplotsfx(xmap[:,:,1])

Visual inspection

Time series

timeseriesfx(xmap[720,360,1519:end], 20, 30) 

Chunks overview

Different chunking settings for the same data set.

# with the 'eachchunk' command we visualize
# the index range of each chunk
chunkmap = eachchunk(xmap) # spatial chunk
sizemap = first(chunkmap)

chunkbox = eachchunk(xbox); # box chunk
sizebox = first(chunkbox)

print("The indices for the first spatial chunk are ", sizemap)
println("\nThe indices for the first box chunk are ", sizebox)
The indices for the first spatial chunk are (1:1440, 1:720, 1:1)
The indices for the first box chunk are (1:90, 1:90, 1:256)

Spatial and box chunking

Spatial chunking

spatialchunkfx(chunk1)


Box chunking

boxchunkfx(chunk2)


Data access performance

Spatial chunking

# spatial access (one map layer)
@time xmap[:,:,1];

# temporal access (time series)
@time xmap[1,1,:];
  0.020544 seconds (43 allocations: 3.957 MiB)
 29.568397 seconds (207.55 k allocations: 11.461 MiB, 0.17% compilation time: 88% of which was recompilation)

Box chunking

# spatial access (one map layer)
@time xbox[:,:,1];

# temporal access (time series)
@time xbox[1,1,:];
  4.328347 seconds (43 allocations: 3.957 MiB)
  0.276869 seconds (41 allocations: 9.453 KiB)

Take home message (1)

  • Time required to access geospatial data and time series varies depending on the characteristics of the chunks in the dataset.
  • Spatial chunking can access spatial layers about a hundred times faster than time series.
  • The box chunking offers a good compromise when analyses are required across all axes.

Statistical computation

Mean and Median have different computation requierments

Mean per pixel

using Statistics
@time xmeanmap = mean(xmap, dims=3);
@time xmeanbox = mean(xbox,dims=3);
 33.714522 seconds (780.54 k allocations: 7.685 GiB, 0.55% gc time, 0.69% compilation time: 100% of which was recompilation)
 35.278500 seconds (50.26 k allocations: 7.646 GiB, 0.15% gc time)
geoplotsfx(xmeanmap[:,:])

Statistical computation

Mean per time step

@time tmeanmap = mean(xmap, dims=(1,2));
@time tmeanbox = mean(xbox,dims=(1,2));
 32.294519 seconds (158.10 k allocations: 7.647 GiB, 0.22% gc time, 0.14% compilation time: 100% of which was recompilation)
 35.587795 seconds (50.26 k allocations: 7.642 GiB, 0.19% gc time)
timeseriesfx(tmeanmap[1519:end], 0, 10)

Take home message (2)

  • The computation time of the mean is similar regardless of the chunking properties and used axes.
  • This is due to the fact that the mean is a cumulative operation.
  • The mean computation is properly handled by DiskArrays.jl.

Statistical computation

Median by pixel

Spatial chunking

# subset spatial chunking
sub1 = view(xmap,1:2, 1:2,:)
out1 = zeros(size(sub1,1),size(sub1,2));

# subset box chunking
sub2 = view(xbox,1:2, 1:2,:)
out2 = zeros(size(sub2,1),size(sub2,2))

# Note: this is a very unefficient looping through the dataset!!!
@time for ilat in axes(sub1,2), ilon in axes(sub1,1)
    out1[ilon,ilat] = median(sub1[ilon,ilat,:])
end

@time for ilat in axes(sub2,2), ilon in axes(sub2,1)
    out2[ilon,ilat] = median(sub2[ilon,ilat,:])
end
119.032849 seconds (848.62 k allocations: 49.349 MiB, 0.21% compilation time: 98% of which was recompilation)
  1.075986 seconds (386 allocations: 77.078 KiB)

Warning:

This is a very unefficient looping through the dataset!

Statistical computation

Median by pixel

Another approach is to read the data by blocks

latsteps = 90
latranges = [(i*90-latsteps+1):(i*90) for i in 1:(720÷latsteps)];

And then compute the median for each block one after the other

out3 = zeros(size(xmap,1),size(xmap,2))
@time for ilat in latranges
    out3[:,ilat] = median(xmap[:,ilat,:],dims=3)
end

out4 = zeros(size(xbox,1),size(xbox,2));

@time for ilat in latranges
    out4[:,ilat] = median(xbox[:,ilat,:],dims=3)
end
290.432223 seconds (14.96 M allocations: 8.013 GiB, 0.20% gc time, 0.14% compilation time: 100% of which was recompilation)
 85.406716 seconds (14.08 M allocations: 7.962 GiB, 0.68% gc time)

Final remarks

  • Chunking is critical for efficient data access when the entire data set cannot be loaded into memory.
  • Calculations such as the mean are not affected by chunking if an appropriate library and code are used.
  • To ensure optimal performance for all operations an appropriate chunking size should be chosen considering the type of analyses and required dimensions.

Thank you for watching!



Repository: github.com/linamaes/chunking_tutorial